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Abstract 

Describing the collective activity of neural populations is a daunting task: the 
number of possible patterns grows exponentially with the number of cells, resulting 
in practically unlimited complexity. Recent empirical studies, however, suggest a vast 
simplification in how multi-neuron spiking occurs: the activity patterns of some cir- 
cuits are nearly completely captured by pairwise interactions among neurons. Why are 
such pairwise models so successful in some instances, but insufficient in others? Here, 
we study the emergence of higher-order interactions in simple circuits with different 
architectures and inputs. We quantify the impact of higher-order interactions by com- 
paring the responses of mechanistic circuit models vs. "null" descriptions in which 
all higher-than-pairwise correlations have been accounted for by lower order statistics, 
known as pairwise maximum entropy models. 

We find that bimodal input signals produce larger deviations from pairwise predic- 
tions than unimodal inputs for circuits with local and global connectivity. Moreover, 
recurrent coupling can accentuate these deviations, if coupling strengths are neither 
too weak nor too strong. A circuit model based on intracellular recordings from ON 
parasol retinal ganglion cells shows that a broad range of light signals induce unimodal 
inputs to spike generators, and that coupling strengths produce weak effects on higher- 
order interactions. This provides a novel explanation for the success of pairwise models 
in this system. Overall, our findings identify circuit-level mechanisms that produce and 
fail to produce higher-order spiking statistics in neural ensembles. 



Author Summary 

Neural populations can, in principle, produce an enormous number of distinct multi-cell 
patterns — a number so large that the frequency of these patterns could never be mea- 
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sured experimentally. Remarkably, the activity of many circuits is well captured by simpler 
probability models that rely only on the activity of single neurons and neuron pairs. These 
pairwise models remove higher-order interactions among groups of more than two cells in a 
principled way. Pairwise models succeed even in cases where circuit architecture and input 
signals seem likely to create a more complex set of outputs. We develop a general approach 
to understanding which network architectures and input signals will lead such models to 
succeed, and which will lead them to fail. 

As a specific application, we consider the remarkable empirical success of pairwise mod- 
els in capturing the activity of a class of retinal ganglion cells — the output cells of the 
retina. Our theory provides a direct explanation for these findings based on the filtering 
and spike generation properties of ON parasol retinal circuitry, in which collective activity 
arises through common feedforward inputs to spiking cells together with relatively weak gap 
junction coupling. Specifically, filtering of light input upstream of ON parasol cells shapes 
inputs to these cells in such a way that when they are processed by parasol cells, the output 
spiking patterns are closely fit by a pairwise model. 



Introduction 

Information in neural circuits is often encoded in the activity of large, highly interconnected 
neural populations. The combinatoric explosion of possible responses of such circuits poses 
major conceptual, experimental, and computational challenges. How much of this potential 
complexity is realized? What do statistical regularities in population responses tell us about 
circuit architecture? Can simple circuit models with limited interactions among cells capture 
the relevant information content? These questions are central to our understanding of neural 
coding and decoding. 

Two developments have advanced studies of synchronous activity in recent years. First, 
new experimental techniques provide access to responses from the large groups of neurons 
necessary to adequately sample synchronous activity patterns [l]. Second, maximum en- 
tropy approaches from statistical physics have provided a powerful approach to distinguish 
genuinely higher-order synchrony (interactions) from that explainable by pairwise statistical 
interactions among neurons (2]-[4). These approaches have produced diverse findings. In some 
instances, activity of neural populations is extremely well described by pairwise interactions 
alone, so that pairwise maximum entropy models provide a nearly complete description |5,6|. 
In other cases, while pairwise models bring major improvements over independent descrip- 
tions, it is not clear that they fully capture the data [3[|7|-[l2]. Empirical studies indicate that 



pairwise models can fail to explain the responses of spatially localized triplets of cells [H , 13 



as well as the activity of populations of ^100 cells responding to natural stimuli 13 . Over- 
all, the range of empirical results highlights the need to understand the network and input 
features that control the statistical complexity of synchronous activity patterns. 

Several themes have emerged from efforts to link the correlation structure of spiking activ- 
ity to circuit mechanisms using generalized 14-17] and biologically-based models [3,18,19 



Two findings are particularly relevant for the present study. First, thresholding nonlinearities 
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in circuits with Gaussian input signals can generate correlations that cannot be explained 
by pairwise statistics |14|; the deviations from pairwise predictions are modest at moder- 
ate population sizes [16|, but may become severe as population size TV grows large [14,20 



Cluster sizes (i.e., the number of cells firing simultaneously), in particular, may be poorly 
fit by pairwise models. For large population sizes, a widespread distribution of cluster sizes 
requires interactions of all orders |14|, although for population sizes explored in experimental 



data third or fourth order interactions may be sufficient 10 . Poor fitting of cluster sizes by 



the pairwise model is also noted in networks of recurrent integrate-and-fire units with adapt- 
ing thresholds and refractory potassium currents [19). Small groups of cells that perform 
logical operations can be shown to generate higher-order interactions by introducing noisy 
processes with synergistic effects (41 , but it is unclear what neural mechanisms might produce 
similar distributions. These diverse findings point to the important role that circuit features 
and mechanisms — input statistics, input/output relationships, and circuit connectivity — 
may play in regulating higher-order interactions. Nevertheless, we still lack a systematic 
understanding that links these features and their combinations to the success and failure of 
pairwise statistical models. 

Second, perturbation approaches can explain why maximum entropy models with purely 
pairwise interactions capture circuit behavior when the population firing rate is low (i.e. the 



total number of firing events from all cells in the same small time window is small) |17[|21 ,22 
The fraction of multi-information fi] captured by the pairwise model, a common metric of 
success, is necessarily low in this regime [l7| . In this regime, as well, higher-order interactions 
cannot be introduced as an artifact of under-sampling the network [22]. The studies which 
find that pairwise models are successful in capturing multivariate spiking data, however, 
include cases which either extend beyond (5|[6), or are marginally within |7|, the low spikes 
per time bin regime. Nor do perturbation results necessarily account for the success of models 
where pairwise connections are restricted to nearest-neighbor connections |5|. Therefore, an 
explanation for the empirical successes of pairwise models remains incomplete. 

Here, we aim to bridge some of the gaps between our understanding of mechanistic and 
statistical models. Our strategy is to systematically characterize the ability of pairwise max- 
imum entropy (PME) models to capture the responses of several simple circuits (see Figure 
[I]). In each case, we search exhaustively over the entire parameter space for networks with 
a simple thresholding model of spike generation. Studies of feedforward circuits reveal that 
the success of PME models does not always bear a simple relation to network architecture 
— we find examples in which networks with local connections can deviate substantially from 
PME predictions, while networks with global connections can be well approximated by PME 
models. A more consistent determinant of the success of PME models is the unimodal vs. 
bimodal profile of inputs to the circuit. In addition to departures driven by these inputs, 
we show that excitatory recurrent coupling can increase departures from PME models by a 
further 3-5 fold. 

We apply these general results to responses of specific networks based on measured prop- 
erties of primate ON parasol ganglion cells. We find that these responses are closely ap- 
proximated by PME models for a wide range of input types. PME models are successful in 
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this case because the temporal filtering properties of parasol cells induce unimodal synaptic 
inputs for a broad range of light inputs and because the measured coupling strengths are 
insufficient to produce large effects on higher-order interactions. This provides insight into 
why the measured activity patterns in these cells are well captured by PME models [5j[6j . 



Results 

Our aim is to determine how network architecture and input statistics influence the success 
of PME models. We start by considering networks of three cells, which both allow us to 
make substantial analytical progress and to visualize the results geometrically. We then 
extend these results in two ways: (1) to models based on the measured properties of primate 
ganglion cells, one system in which PME models have been very successful (5||6j; and (2) to 
larger networks. 



A geometric approach to identifying higher-order interactions among 
triplets of cells 

One strategy to identify higher-order interactions is to compare multi-neuron spike data 
against a description in which any higher-order interactions have been removed in a principled 
way — that is, a description in which all higher-order correlations are completely described by 
lower-order statistics. Such a description may be given by a maximum entropy model (2,23 



[24] , which determines how much of the potential complexity of response patterns produced by 
large neural populations can be captured by a given set of constraints. The idea is to identify 
the most unstructured, or maximum entropy, distribution consistent with the constraints. 
Comparing the predicted and measured probabilities of different responses tests whether 
the constraints used are sufficient to explain the network activity, or whether additional 
constraints need to be considered. Such additional constraints would produce additional 
structure in the predicted response distribution, and hence lower the entropy. 

A common approach is to limit the constraints to a given statistical order — for example, 
to consider only the first and second moments of the distributions, which are determined by 
the mean and pairwise interactions. In the context of spiking neurons, we denote \±i = E[x^] 
as the firing rate of neuron i and p^ = E[x^Xj] as the joint probability that neurons i and j 
will fire. The distribution with the largest entropy for a given pi and p^ is often referred to 
as the pairwise maximum entropy (PME) model. The problem is made simpler if we consider 
only permutation-symmetric spiking patterns, in which the firing rate and correlation do not 
depend on the identity of the cells; i.e. pi = p, = p for i ^ j. Thus the PME problem is 
to identify the distribution that maximizes the response entropy given the constraints p and 
p. In this section, we review some results that help provide a geometric, and hence visual, 
approach to this problem (see, for example |19|). 

We consider a permutation-symmetric network of three cells with binary responses. We 
assume that the response is stationary and uncorrelated in time. From symmetry, the pos- 
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sible network responses are 

p = P(0,0,0) 

pi = P(1,0,0) = P(0,1,0) = P(0,0,1) 

p 2 = P(1,1,0) = P(1,0,1) = P(0,1,1) 

P3 = P(l,l,l), 

where pi denotes the probability that a particular set of i cells spike and the remaining 3 — i 
do not. Possible values of (^0,^1,^2,^3) are constrained by the fact that P is a probability 
distribution, meaning that the sum of pi over all eight states is one. We will rearrange these 
response probabilities to define a more convenient coordinate system below. 

Possible solutions to the PME problem take the form of exponential functions character- 
ized by two parameters, Ai and A 2 , which serve as Lagrange multipliers for the constraints, 

P(x 1 ,x 2 ,x 3 ) = ^exp[Ai(xi + x 2 + x 3 ) + X 2 (x 1 x 2 + x 2 x 3 + Xix 3 )]. (1) 

The factor Z normalizes P to be a probability distribution. We can combine the individual 
probabilities of events 

1 

Po = Z 

Pi = — exp(Ai) 

P2 = ^ exp(2Ai + A 2 ) 

P3 = \ exp(3Ai + 3A 2 ) 



(2) 



to yield the equation 

This is equivalent to the condition that the strain measure defined in [25] be zero (in particu- 
lar, the strain is negative whenever p 3 / Po — (p 2 / Pi) 3 < 0, a condition identified in [25] as cor- 
responding to sparsity in the neural code). Equation [2] defines, implicitly, a two-dimensional 
surface in the three-dimensional space of possible probability distributions which we call the 
maximum entropy surface. The family of distributions consistent with a given /i and p forms 
a line (the iso-moment line) in this space [19] . The PME fit is the intersection of this line 
with the surface defined by Equation [2} 

This geometrical description of the PME problem takes a particularly simple form in an 
alternative coordinate space: 

f P = Ps+Po 

fi P = (3) 
P3+P0 



flm 



P2 
P2+P1 
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This set of coordinates separates network responses based on whether they are "pure" (all 
cells either spike, or do not) or "mixed" (only a subset of cells spike). f p is the fraction of 
observed responses that are pure; f\ p is the fraction of pure responses with more cells spiking 
than not {jp% vs. po). fi m is the fraction of mixed responses with more cells firing than not 
(p 2 vs. pi). Possible probability distributions are contained within a cube in this coordinate 
space: < f p) fi p) fi m < 1- The iso-moment line for a given \i and p is still a line in this 
coordinate space (see Methods), and the PME approximation is given by the intersection of 
this line with the maximum entropy surface. 

The convenience of this coordinate system is apparent when the maximum entropy con- 
straint (Equation [2]) is rewritten: 

flp = i-sfiT+sfL' (4) 

This surface is independent of f p — i.e. the maximum entropy surface forms a curve when 
projected into the /i m )-plane. In addition, each iso-moment line lies in a constant f p 
plane; the distance of an observed distribution P from the surface is thus easily visualized. 
This geometric view of the relation between the activity of a given network and its PME fit is 
particularly useful in visualizing results from exhaustive searches across network parameters, 
as described below. 

We use the Kullback-Leibler divergence, D KL {P ) P) ) to quantify the accuracy of the 
PME approximation P to a distribution P. This measure has a natural interpretation as 
the contribution of higher-order interactions to the response entropy S(P) (2||4), and may in 
this context be written as the difference of entropies S(P) — S(P). In addition, D KL (P, P) 
is approximately — log 2 L ) where L is the average likelihood — that is, the average (i.e. per 
observation) relative likelihood that a sequence of data drawn from the distribution P was 



instead drawn from the model P [5,26 . For example, if D KL = 1, the average likelihood 
that a single sample from P — i.e. a single network response — in fact came from P is 
2 _1 (we use the base 2 logarithm in our definition of the Kullback-Leibler divergence, so all 
numerical values are in units of bits). 

An alternative measure of the quality of the pairwise model comes from normalizing 
Dkl{P)P) by the corresponding distance of the distribution P from an independent max- 
imum entropy fit Dkl(P, P\), where Pi is the highest entropy distribution consistent with 
the mean firing rates of the cells (equivalently, the independent model P\ is given by the 

use 



product of single-cell marginal firing probabilities) \2\. Many studies [^j?,!? 

a P^ind Dpair 



D 



ind 



d kl (p,PiY {) 



where following 17 we define D in d = D KL (P, Pi) and D pair = D KL (P, P). A value of A = 1 
(100%) indicates that the pairwise model perfectly captures the additional information left 
out of the independent model, while a value of A = indicates that the pairwise model gives 
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no improvement over the independent model. Because we are interested in cases where the 
pairwise model fails to capture circuit outputs, at first it seems we would wish to identify 
circuits which produce a low A. However, in the circuits we explore, we find that the lowest 
values of A are achieved for nearly independent - and therefore "uninteresting" - spike 
patterns. Therefore, we have chosen to use D KL {P ) P) as our primary measure of the quality 
of the pairwise model, while reporting values of A in parallel. 

To get an intuitive picture of D KL (P,P) throughout the cube of possible distributions 
P, we view this quantity along constant-/^ slices in Figure [2J Dkl(P, P) increases with 
distance from the constraint curve (Equation [4]); along the iso-moment line for a given 
D KL (P,P) is convex with a minimum of zero at P = P (detailed calculations are 
given in Materials and Methods). Therefore, for any choice of /1 and p, D KL {P ) P) increases 
monotonically as a function of the distance along an iso-moment line. The distance, which is 
easily visualized, thus gives an indication of how close P comes to being a pairwise maximum 
entropy distribution. The observed distribution with the maximal deviation from its pairwise 
maximum entropy approximation will occur at one of the two points where the iso-moment 
line reaches the boundary of the cube. The global maximum of D KL (P,P) will, therefore, 
also occur on the boundary. 

To assess the numerical significance of Dkl{P, P), we can compare it with the maximal 
achievable value for any symmetric distribution on three spiking cells. For three cells, this 
value is 1 (or 1/3 bits per neuron), achieved by the XOR operation E] — i.e. the average 
probability that a single output of the XOR circuit came instead from the PME approxima- 
tion is 2 _1 . This distribution, along with its position in the / lm ) coordinate space, is 
illustrated in Figure [2] (f p = 0.25 slice) and the right column of Figure [3j We will find that 
distributions produced by permutation-symmetric networks fall far short of this value. 

In summary, we have shown that identifying high-order interactions in the joint firing 
patterns of three cells is equivalent to showing that spiking probabilities lie a substantial 
distance from a constraint surface that is easy to visualize. Given this geometric description 
of the problem, we next consider how the distance from the constraint surface depends on 
circuit connectivity, nonlinear properties of the individual circuit elements, and the statistics 
of the input signals (Figure [I]). 



When do triplet inputs produce higher-order interactions in spike 
outputs? 

We first considered a simple feedforward circuit in which three spiking cells sum and threshold 
their inputs. Each cell j received an independent input Ij and a "triplet" — or global — 
input I c that is shared among all three cells. Comparison of the total input Sj = I c + Ij with 
a threshold determined whether or not the cell spiked in that time bin. The nonlinear 



threshold can produce substantial differences between input and output correlations [27-31 
An additional parameter, c, identified the fraction of the total input variance a 2 originating 
from the global input; that is, c = Var[/ C ]/Var[/ C + Ij]. 

What types of inputs might be natural to consider? The constraint surface (Equations 
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[2] and [4]) can give us some intuition into what types of inputs will be more or less likely to 
produce spiking responses that will deviate from the pairwise model. For example, suppose 
that I c can take on values that cluster around two separated values, \ia < /j>b, but that it 
is very unlikely that I c ever takes on values in the interval between; that is, the distribution 
of I c is bimodal If /x^ is large enough to push the cells over threshold but \ia is not, then 
we see that any contribution to the right-hand side of Equation [2j P2/P1, depends only 
on the distribution of the independent inputs Ij] if either one or two cells spike, then the 
common input must have been drawn from the cluster of values around /x^, because otherwise 
all three cells would have spiked. Keeping the values of fi a and the same, therefore, 
but changing the relative likelihood of drawing the common input from one cluster or the 
other, would change the ratio ps/po without changing the ratio p^lpx- Hence the constraint 
specifying those network responses exactly describable by PME models can be violated when 
the common input is bimodal. 

In contrast, we may instead consider a unimodal input — one whose distribution has a 
single peak or range of most likely values — of which a Gaussian input is a natural example. 
Here, the distribution of the common input I c is completely described by its mean and 
variance; both parameters can impact the ratio p^/po (by altering the likelihood that the 
common input alone can trigger spikes) and the ratio p2/pi- Each value of I c is consistent 
with both events p\ and p2 : with the relative likelihood of each event depending on the 
specific value of 7 C ; it is no longer clear how to separate the two. 

Motivated by these observations, we choose our inputs to come from either one of three 
types of unimodal distributions — Gaussian, skewed, or uniform — or a bimodal distribution. 
The one-sided skewed shape, in particular, is chosen to mimic qualitative features of inputs to 
retinal ganglion cells (e.g. |32| and below). We indeed find both qualitative and quantitative 
differences in the capacity of each class of inputs to generate deviations from the pairwise 
model. 



Unimodal inputs fail to produce higher-order interactions in three cell feedfor- 
ward circuits 

We first considered unimodal inputs, which were chosen from a distribution with a single 
peak (or range of most likely values). Gaussian inputs provide a natural example. If I c and 
each Ij are Gaussian, then the joint distribution of S = (Si, £2, S3) is multivariate normal, 
and therefore characterized entirely by its means and covariances. Because the PME fit to a 
continuous distribution is precisely the multivariate normal that is consistent with the first 
and second moments, every such input distribution on S exactly coincides with its PME 
fit. However, even with Gaussian inputs, outputs (which are now in the binary state space 



{0, l} 3 ) will deviate from the PME fit 14, 16 . As shown below, non-Gaussian unimodal 



inputs can produce outputs with larger deviations. Nonetheless, these deviations in all cases 
are small, and PME models are quite accurate descriptions of circuits with a broad range of 
unimodal inputs. 

In more detail, we considered a circuit of three cells with inputs I c and Ij that could be 
Gaussian, uniform, or skewed. For each type of input distribution, we probed the output 
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distribution across a range of values for c, a, and that explored "all" possible activity 
patterns. In particular, we covered a full range of firing rates (or equivalent ly spikes per 
time bin), not limited to the low firing rate regime treated in |17|. Figure [3JA.-C shows 
observed distributions for different marginal input statistics (left column). The central col- 
umn compares all observed distributions with the PME constraint curve, projected into the 
(Zip, Am)-plane. 

The right column of Figure |4|/\-C shows Dkl{P-> P) as a function of c and a for the value 
of O that maximized Dkl{P, P) (or one of them, if multiple such values exist). Different 
scales are used to emphasize the structure of the data. For the unimodal cases shown, D KL 
peaked in regions with comparatively low input variance (a < 1) and large relative strength 
of common input (c > 0.5). However, D KL {P ) P) never reached a very high numerical value 
for unimodal inputs; the maximal values achieved for Gaussian, skewed, and uniform distri- 
butions are 0.00376 (A = 0.989), 0.0152 (A = 0.999), and 0.0186 (A = 0.9428) respectively 
(compare with Figure [3]) . Thus hundreds or thousands of samples would be required to 
reliably distinguish the outputs of these networks from the PME approximation. 

Clear patterns emerged when we viewed D KL (P, P) as a function of output spiking statis- 
tics rather than input statistics. Figure [5]/\-B show the same data contained in the center 
column of Figure [4j but now plotted with respect to the output firing rate, which is the same 
for all cells. The data were segregated according to the correlation coefficient p between 
the responses of cell pairs, with lighter shades indicating increasing correlation. For a fixed 
correlation, there was generally a one-to-one relationship between firing rate and D KL {P ) P). 
For unimodal distributions (Figure [5]/\-B), D KL (P,P) showed a double-peaked relationship 
with firing rate, with larger values attained at low and high firing rates, and a minimum in 
between. Additionally, Dkl(P^ P) had a non-monotonic relationship with spike correlation: 
it increased from zero for low values of correlation, obtained a maximum for an intermediate 
value, and then decreased. These limiting behaviors agree with intuition: a spike pattern 
that is completely uncorrelated can be described by an independent distribution (a special 
case of PME model), and one that is perfectly correlated can be completely described via 
(perfect) pairwise interactions alone. 

Bimodal triplet inputs can generate higher-order interactions in three cell feed- 
forward circuits 

Having shown that a wide range of unimodal common inputs produced spike patterns that are 
well-approximated by PME fits, we next examined bimodal inputs. Figure |4p shows results 
from a simple ensemble of bimodal inputs — Bernoulli-distributed common and independent 
inputs — that produced moderate deviations from the pairwise approximation. The common 
input was "1" with probability p and "0" with probability 1 — p. The independent inputs 
were each chosen to be "1" with probability q and a 0" with probability 1 — q. The threshold 
of the cells was between 1 and 2, so that spiking required both common and independent 
inputs to be active. The space of possible spiking distributions was explored by varying p 
and q. 

This circuit produced response distributions that deviated modestly from PME fits, and 
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these distributions preferentially lie on one side of the constraint curve (Figure [4]D, center). 
The largest values of Dkl{P-> P) occurred where moderate correlated input is coupled with 
strong background input (q > 0.5; Figure [I]D, right), and reached values that are five times 
higher than those found for a unimodal distribution (the maximal value achieved is 0.091). 
The location of the distribution generating this maximum value is demonstrated in the center 
column of Figure [3| 

Both of these observations can be explained by direct calculation of the spiking probabili- 
ties. Substituting the probabilities of different events — Po = l—p+p(l — q) 3 , p\ = pq^ — q) 2 , 
P2 = PQ 2 (^ — q) and ps = pq 3 — into the constraint surface specifying PME models (Equation 
[2| and dividing by g 3 , we can write 

(6) 



1 — p + p(l — q) 3 (1 — q) 3 

which gives us an intuition for how to violate the constraint; for a fixed g, we manipulate 
the left-hand side by changing p. 

Another way to view this is by observing that the right hand side of Equation [2] can be 
written without reference to the probability of common input; because P[l spike | I c — 0] = 
and P [2 spikes | I c = 0] = 0, one may write 

p 2 _ P[2 spikes | I c = 1]P[I C = 1] 

(7) 



pi P[lspike|/ c = l]P[/ c = l] 

P [2 spikes | I c = 1] 



P[l spike I I c = 1] 

which has no dependence on the statistics of the common input. So the left hand side of the 
constraint equation (Eqn. [2]) can be manipulated by shifting p, without making any changes 
to the right hand side. 

In Figure [5]C, we again present values of D KL (P,P) as a function of the firing rate 
and pairwise correlation elicited by the full range of possible bimodal inputs. We see that 
Dkl{P) P) is maximized at a single, intermediate firing rate, and for correlation values near 
0.6. 

We find distinctly different patterns when we view A (i.e. D KL normalized by the distance 
of P from an independent maximum entropy fit, Equation [5]) , for these same simulations, as 
a function of output spiking statistics (Figure [5]D-F). For unimodal distributions (Figure [5]D- 
E), A is very close to 1, with the few exceptions at extreme firing rates. For bimodal inputs 
(Figure[5]F), A may be appreciably far from 1 — as small as 0.5 — with the smallest numbers 
(suggesting a poor fit of the pairwise model) occurring for low correlation p. This highlights 
one interesting example where these two metrics for judging the quality of the pairwise 
model, Dkl{P->P) and A, yield contrasting results. 

An analytical explanation for unimodal vs. bimodal effects 

We next develop analytical support for the distinct impact of unimodal vs. bimodal inputs 
on fits by the PME model. Specifically, we calculate Dkl{P, P) for small deviations from the 
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PME constraint surface of Equation [2| We summarize the results of this calculation here; 
details are in Materials and Methods. We considered narrow distributions of common input 
I C) with a small parameter c indicating their variance. By approximating the distribution 
of network outputs by a Taylor series in c, we found that the leading order behavior of 
Dkl{P, P) depended on c 3 for unimodal distributions — i.e the low order terms in c dropped 
out (for symmetric distributions, such as Gaussian, the growth was even smaller: c 4 ). For 
bimodal distributions, on the other hand, the leading order term of D KL (P,P) grew like 
c 2 . The key point is that, as the strength of common input signals increased, circuits with 
bimodal inputs diverged from the PME fit much more rapidly than circuits with unimodal 
inputs. 

When do pairwise inputs produce higher-order interactions in spike 
outputs? 

In the previous sections, we considered permutation-symmetric distributions generated by 
a single, global, common input. Another class of permutation-symmetric distributions can 
be generated when common inputs are shared pairwise — i.e. by two cells but not three at 
once. We now show that significant departures from the pairwise maximum entropy model 
(PME) can be generated with pairwise bimodal inputs. This provides a specific example in 
which network architecture and output statistics are not in simple correspondence. 

Our circuit setup included three cells, each of which received and summed two inputs 
and spiked if this sum exceeded a threshold. We denote the inputs 7 12 , 723? hs so that cell 2 
received 7 12 and 7 2 3, and so forth, as illustrated in Figure [TJ Each input was chosen from a 
binary distribution with parameters m and r so that P[iy = m] = r and P[hj = 0] = 1 — r. 
Without loss of generality, we chose m = 1 and chose the threshold such that 1 < < 2. 
Therefore, both pairwise inputs to a cell must be active in order for a cell to fire. It is not 
possible in this circuit for precisely two cells to fire; for two cells to fire (say cell 1 and cell 
2), both inputs to each cell must be active. However, this implies that both inputs to cell 
3 (ii3 and I 2 s) are active as well. If two cells fire, then the third must fire as well; that is, 
P2 = 0. 

The remaining probabilities are easily computed by itemizing and computing the proba- 
bilities for each event and are as follows: p% = r 3 , p\ = r 2 (l — r), andp = 3r(l — r) 2 + (l — r) 3 . 
This distribution has a unique PME fit consistent with both the first and second moments. 
However, the PME fit can be far from the actual distribution; we found that D KL (P, P) de- 
pended on the input rate r and could exceed 0.5. Thus observation of the network response 
(single draw from P) would, on average, have a likelihood ratio of 0.7 of coming from P 
versus P, and observation of 15 network responses would have a likelihood ratio of less than 
0.01. 

We will revisit the contrast between global and pairwise common inputs below; we also 
refer to the pairwise case as local inputs. 
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Recurrent coupling produces modest effects on higher-order inter- 
actions 

Neural circuits are rarely purely feedforward, and recurrent connectivity can play an impor- 
tant role in shaping the resulting activity. Therefore, we next modify our thresholding model 
to incorporate the effects of recurrent coupling among the spiking cells. We take all-to-all 
coupling among our N = 3 cells, and assume that this coupling is rapid compared with 
the timescale over which inputs arrive at the cells, so that coupling has its full impact on 
the spike events that are recorded in a single realization of the model. We limit our study 
to excitatory interactions, anticipating an application below to the effects of gap junctions 
known to mediate recurrent coupling between retinal ganglion cells. 

In more detail, our coupling model may be described as follows: first, inputs arrive at 
each cell as for the cases without coupling. If the inputs elicit any spikes, there is a second 
stage in which the input to each neuron receiving a connection from a spiking cell is increased 
by an amount g. This represents a rapid depolarizing current, assumed for simplicity to add 
linearly to the input currents. If the second stage results in additional spikes, the process is 
repeated: recipient cells receive an additional current g, and their summed inputs are again 
thresholded. The sequence terminates when no new spikes occurred on a given stage; e.g., 
for TV = 3, there are a maximum of three stages. The spike pattern that is recorded on a 
given trial is the total number of spikes fired across all of the stages. 

For the simplest case of globally structured inputs and all-to-all coupling, the probability 
of each spike pattern (or count) can be written down analytically, by evaluating the proba- 
bilities along the tree of possible events that unfold through the stages just described. For 
other cases, these trees are more complex, and we instead evaluate probabilities via Monte- 
Carlo sampling (unimodal inputs), or through enumeration of a finite number of possible 
input states (bimodal inputs). 

We begin by describing our results for globally structured, unimodal inputs. We range 
over parameters c, a, and as for the uncoupled cases above, and additionally vary g over a 
wide range of values, from g = to g = 2.4 (comparable to the maximum threshold value). 
Our basic finding is that coupling has only modest effects on D KL values, when compared 
with the theoretical range of values that could be obtained. Specifically, the largest value 
found over all parameters tested was D KL = 0.0099 for the case of Gaussian inputs, 0.108 
for uniform, and 0.0599 for skewed. All of these values are roughly 3 — 5x values obtained 
for the uncoupled cases, and, for Gaussian and skewed inputs, remain much smaller than 
values found for bimodal inputs. The largest value achieved over any unimodal input and for 
any coupling strength (0.108) is comparable to the largest value found with bimodal inputs 
without coupling. 

Figure [6]A shows the data for Gaussian inputs as a scatter plot of D KL values vs. coupling 
strengths g ) showing that intermediate values of coupling — for which the size of coupling 
terms is about the same as the standard deviation of the inputs — have the greatest ability to 
drive departures from the PME model. Moreover, when coupling strength was varied while 
keeping other parameters fixed, Dkl was also generally maximized at intermediate values of 
the coupling. In Figure pp, we show D KL (P,P) vs. coupling strength g ) for representative 
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values of the circuit parameters c, a and 0. For each case of global Gaussian, skewed, 
uniform, and bimodal inputs (thin lines), we see a peak in Dkl{P->P) at medium values of 
9> 

We next introduce coupling to circuits with pairwise, locally structured unimodal inputs 
(as in the preceding section). Here, we range over parameters a and (c is fixed at 0.5), and 
additionally vary g over the same wide range of values above. For local inputs, the maximum 
value of D KL found could be increased modestly by including coupling (from 0.0013 to 0.0086 
with Gaussian inputs, from 0.0150 to 0.0157 with skewed inputs, and from 0.0233 to 0.0547 
with uniform inputs), but remains substantially below the values found for global inputs with 
coupling. As in circuits with global inputs, D KL was generally maximized at intermediate 
values of the coupling; this is illustrated for uniform inputs in the third panel of |6]B (note 
that global and local inputs are identical for Gaussian inputs). An exception was observed 
for skewed inputs, illustrated in the top panel of[6]B, where D KL shows a maximum at g = 0. 

Finally, we introduce coupling to circuits with bimodal inputs. We note that because 
the inputs can only take on a finite number of configurations, with all other parameters 
fixed, the output distribution of the network can only change at a few discrete values of g. 
Moreover, large values of g induce "all-or-none" firing behavior, which is perfectly matched 
by pairwise maximum entropy. For global inputs, we find modest increases in Dkl which 
are maximized at an intermediate value of the coupling; for pairwise inputs, the output 
distribution attains its maximum level of higher order interactions at g = and decreases 
only when g is sufficiently strong enough to create an "all-or-none" firing pattern (bottom 
panel, Fig. [6]B). 

In summary, both input statistics and connectivity shape the development of higher- 
order interactions in our basic circuit model. Other parameters being equal, bimodal inputs 
can generate a larger D KL (P,P) than unimodal inputs. For a particular choice of input 
marginals, global inputs can generate greater deviations than purely pairwise inputs (with 
the exception of one case, that locally structured bimodal inputs). However, the goodness of 
the PME fit alone does not distinguish between global and pairwise anatomical projections. 
Adding fast recurrent excitation increases the accessible magnitude of higher-order interac- 
tions; for a fixed 0, the range of accessible D KL increases — and then decreases — with 
the magnitude of recurrent excitation. With these principles in hand, we now study a more 
biologically realistic network model. 

An experimentally constrained model for correlated firing in retinal 
ganglion cells 

PME approaches have been effective in capturing the activity of small retinal ganglion cell 
(RGC) populations (5||7|. This success does not have an obvious anatomical correlate — i.e. 
there are multiple opportunities in the retinal circuitry for interactions among three or more 
ganglion cells. Why do these apparently fail to generate higher-order interactions in output 
spiking data? To answer this question, we explored the properties of circuits composed of 
cells with input statistics, recurrent connectivity and spike-generating mechanisms based 
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directly on experiment. We based our model on ON parasol RGCs, one of the RGC types 
for which PME approaches have been applied extensively (5j[6). We first describe the RGC 
model, then apply this to feedforward circuits, and finally consider recurrent connections. 



RGC model 

We modeled a single ON parasol RGC in two stages (for details see Materials and Methods). 
First, we characterized the light-dependent excitatory and inhibitory synaptic inputs to cell 
k (g^ c (t)^ fi r ^ nh (t)) in response to randomly fluctuating light inputs s(t) via a linear-nonlinear 
model, e.g.: 

gr(t) = iv-[L exc * Sfc (t)+C c ], (8) 

where N e * c is a static nonlinearity, L exc is a linear filter, and r]^ is an effective input noise 
that captures variability in the response to repetitions of the same time- varying stimulus. 
These parameters were determined from fits to experimental data collected under condi- 
tions similar to those in which PME models have been tested empirically. The modeled 
excitatory and inhibitory conductances captured many of the statistical features of the real 
conductances, particularly the correlation time and skewness. 

Second, we used Equation [8] and an equivalent expression for g^ h (t) as inputs to an 
integrate-and-fire model incorporating a nonlinear voltage and history-dependent term to 



account for refractory interactions between spikes 33 . The voltage evolution equation was 
of the form 

^ = F(V,t-t last ) + 1 -^-, (9) 
where F(V, t — ti ast ) was allowed to depend on the time of the last spike ti ast (more details 



in Methods). We fit the parameters of this model to a dynamic clamp experiment 34 ,[35] in 
which currents corresponding to g exc (t) and g mh (t) were injected into a cell and the resulting 
voltage response measured. Excitatory and inhibitory synaptic currents injected during 
one time step were determined by scaling the conductances by driving forces based on the 
measured voltage in the previous time step. Recurrent connections were implemented by 
adding an input current proportional to the voltage difference between the two coupled cells. 

The prescription above provided a flexible model that we used to study the responses of 
small RGC networks to a wide range of light inputs and circuit connectivities. Specifically, 
we simulated RGC responses to light stimuli that were (1) constant, (2) time- varying and 
spatially uniform, and (3) varying in both space and time. Correlations between cell inputs 



arose from shared stimuli, from shared noise originating in the retinal circuitry 32 , or 



from recurrent connections [32, 36 . Shared stimuli were described by correlations among 
Sjfe. Shared noise arose via correlations in rjk as described in Materials and Methods. The 
recurrent connections were chosen to be consistent with observed gap-junctional coupling 
between ON parasol cells. We also investigated how stimulus filtering by L exc and L mh 
influenced network statistics. 
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For each network model, we determined whether the accuracy of a PME fit to the outputs 
was predictable based on the structure of the input distributions, using the results developed 
above for the idealized circuit model. We focused on excitatory conductances because they 
exhibit stronger correlations than inhibitory conductances in ON parasol RGCs |32|. To 
compare our results with empirical studies, constant light and spatially and temporally 
fluctuating checkerboard stimuli were used as in (5|[6). 

Feedforward RGC circuits 

We start by considering networks without recurrent connectivity and without temporal mod- 
ulations in the light input. Thus we set Sfc(t) = for k = 1,2,3, so that the cells received 
only Gaussian correlated noise 7^ xc and rf£ h and constant excitatory and inhibitory conduc- 
tances. Time-dependent conductances were generated and used as inputs to a simulation of 
three model RGCs. Simulation length was sufficient to ensure significance of all reported 
deviations from PME fits (see Materials and Methods). Under these conditions the excita- 
tory conductances were unimodal and broadly Gaussian. As expected from earlier results 
on threshold models, the spiking distributions were well-modeled by a PME fit, as shown 
in Figure [TJa.; D kl (P,P) is 2.90 x 10 -5 bits. This agrees with the very good fits found 
experimentally in |5| under constant light stimulation. 

For full-field simulations, each cell received the same stimulus, Sfc(t) = s(t), where s(t) 
refreshes every few milliseconds with an independently chosen value from one of several 
marginal distributions. The shared stimulus produced strong pairwise correlation between 
conductances of neighboring cells. However, results from our threshold model (Fig.ji]) suggest 
that this is not the overall determining factor in whether or not spiking outputs will be 
well-modeled by a PME fit; rather, the shape of the marginal distribution of inputs (here, 
conductances) should be more important than the strength of pairwise correlations. 

We first examined the effects of different marginal statistics of light stimuli, standard 
deviation of full-field flicker, and refresh rate on the marginal distributions of excitatory 
conductances. For a short refresh rate (8 ms) and small flicker variance (1/6 or 1/4 of 
baseline light intensity) , temporal averaging via the filter L exc and the approximately linear 
form of 7V exc over these light intensities produced a unimodal, modestly skewed distribution 
of excitatory conductances, regardless of whether the flicker is drawn from a Gaussian or 
binary distribution (see Figure [7]B-C, center panels). For a slower refresh rate (100 ms) 
and large flicker variance (1/3 or 1/2 of baseline light intensity), excitatory conductances 
had multi-modal and skewed features, again regardless of whether the flicker is drawn from 
a Gaussian or binary distribution (Figure [7|D). Other parameters being equal, binary light 
input produced more skewed conductances. While some conductance distributions had mul- 
tiple local maxima, these were never well-separated, with the envelope of the distribution 
still resembling a skewed distribution. 

As expected from our studies with the simple thresholding model of spike generation, 
the largely unimodal shape of input distributions was reflected in the ability of PME fits 
to accurately capture spiking distributions. D KL (P,P) values computed from the observed 
distributions were small, never exceeding 0.0067; these are numbers comparable to what is 



15 



achievable by skewed inputs in our simple thresholding circuit model. To test the sensitivity 
of this conclusion to the finite sampling in our simulations, we performed an analysis in which 
the data was divided into 20 subsets and the maximum entropy analysis was performed indi- 
vidually on each subset. The resulting KL-distance remained small, never exceeding 0.0089. 
In summary, even high-contrast, bimodal, highly spatially correlated stimulus variations do 
not produce a large departure from the PME fit. 

When we examined all of the spiking distributions produced in this sequence of simula- 
tions, we found a common pattern in the way in which the PME fit deviated from observed 
distributions. Single spiking events were over-predicted by PME fits, whereas double spiking 
events were under-predicted. We note that this is the same situation observed in our simple 
threshold model with bimodal global inputs (see Figure 3 and Materials and Methods), and 
corresponds to the case of negative strain identified by Ohiorhenuan et al. |25|. This find- 
ing is extremely robust; upon perturbing the distributions by estimated standard errors, as 
described in Materials and Methods, only 22 out of 840 perturbed distributions showed a 
positive strain while the remainder had negative strain. 

Overall, high-pass filtering — a consequence of the differentiating linear filter in Equation 
18] and illustrated in Figure [7p — was responsible for significantly reducing the bimodality 



of the input stimuli. In Figure [7]E, we produce a filter without the biphasic, high-pass 
shape of Equation 18 (i.e., without the negative dip at longer time lags) by simply rectifying 
the experimentally determined filter. This filter produced conductance distributions that 
more completely reflect the bimodal shape of binary light inputs (Figure [^E, center panel). 
The resulting simulation produces six-times greater Dk^P^P) (Figure^7|E, right panel). 
This raises the intriguing suggestion that greater Dkl{P-> P) may occur for other cell types 
primarily characterized via monophasic filters (such as midget cells), or for different light 
stimuli (e.g., at different mean levels) for which the retinal circuit acts to primarily integrate 
rather than differentiate over time. 

In Figure [8]/\, we examine this effect over all full-field stimulus conditions by plotting 
deviations from the pairwise model, in terms of D KL {P ) P) ) in simulations with a rectified 
filter against the same quantity with a non-rectified filter. An increase in Dkl{P) P) was 
observed across stimulus conditions, with a markedly higher effect for longer refresh rates. 
Large effects were accompanied by a striking increase in the bi- or multi-modality of ex- 
citatory conductances (see Figure [8]B, illustrating binary stimulus at 100 ms refresh rate 
and standard deviation of 1/4), while small effects were accompanied by small or no dis- 
cernible change in the marginal distribution of excitatory synaptic conductances (see Figure 
8]C, illustrating binary stimulus at 8 ms refresh rate and standard deviation of 1/2). This 
consistent change could not be attributed to changes in lower order statistics; there was no 
consistent relationship between the change in pairwise model performance and either firing 
rate or pairwise correlations (data not shown). 

Moving beyond full field light stimuli, we asked whether pairwise maximum entropy 
models capture RGC responses to stimuli with varying spatial scales. We fixed stimulus 
dynamics to match the two cases that yielded the highest D KL (P,P) under the full field 
protocol: for both Gaussian and binary stimuli, this was an 8 ms refresh rate and a = 1/2. 
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The stimulus was generated as a random checkerboard with squares of variable size; each 
square in the checkerboard, or stixel, was drawn independently from the appropriate marginal 
distribution and updated at the corresponding refresh rate. The conductance input to each 
RGC was then given by convolving the light stimulus with its receptive field, where the 
stimulus is positioned with a fixed rotation and translation relative to the receptive fields. 
This position was drawn randomly at the beginning of each simulation and held constant 
throughout (see Figure [9]D,G for examples, and Materials and Methods for further details). 

The RGC spike patterns remained very well described by PME models for the full range 
of spatial scales. Figure 9]A shows this by plotting D KL (P,P) vs. stixel size. Values of 
Dkl{P,P) increased with spatial scale, sharply rising beyond 128 /xm, where a stixel is 
approximately the same size as a receptive field center. The points at 512 /xm are from 
the corresponding full field simulations, illustrating that introducing spatial scale via stixels 
produces even closer fits by PME models. 

Values reported in Figure [9]A are averages of D KL (P, P) produced by 5 random stimulus 
positions. At stixel sizes of 128 fim and 256 /xm, the resulting spiking distributions differed 
significantly from position to position; in Figure [9]B, we show the probabilities of the distinct 
singlet (e.g. P(1,0,0)) and doublet (e.g. P(1,1,0)) spiking events produced at 256 fim. 
Each stimulus position creates a "cloud" of dots (identified by color); large dots show the 
average over 20 sub-simulations. Each sub-simulation is identified by a small dot of the 
same color; because the simulations are very well-resolved, most of these are contained 
within the large dots (and hence not visible here). Heterogeneity across stimulus positioning 
is indicated by the distinct positioning of differently colored dots. At smaller spatial scales, 
the process of averaging stimuli over the receptive fields results in spiking distributions that 
are largely unchanged as stimulus position changes, as shown in Figure [9p, where singlet 
and doublet spiking probabilities are plotted for 60 /xm stixels. Thus, filtered light inputs 
are largely homogeneous from cell to cell, as each receptive field samples a similar number of 
independent, statistically identical inputs: Figure |9p shows the projection of input stixels 
onto cell receptive fields from an example with 60 /xm stixels. The resulting excitatory 
conductances (Figure |9]E) and spiking patterns (Figure [9]F) are very close to cell-symmetric. 

By contrast, spiking patterns showed significant heterogeneity from cell to cell, when the 
stixel size was large; this arises because each cell in the population may be located differently 
with respect to stixel boundaries, and therefore receive a distinct pattern of input activity. 
Figure [9p shows the projection of input stixels onto cell receptive fields from one example, 
with 256 /im stixels. The difference in statistical properties of inputs is reflected in the 
marginal distributions of excitatory conductances that each cell receives, shown in Figure 
[9]H. It is also apparent in the observed output spiking patterns, where one particular doublet 
spiking pattern (here, 110) is significantly more likely to occur than the others (see Figure 
[9][). However, PME models gave excellent fits to data regardless of heterogeneity in RGC 
responses, as illustrated for this particular example in Figure [9^; over all 20 sub-simulations 
(as above), and over all individual stixel positions, we found a maximal D KL (P, P) value of 
0.00811. 
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Recurrent connectivity in the RGC circuit 



We next considered the role of recurrence in shaping higher order interactions by incorpo- 
rating gap junction coupling into our simulations. We did this separately for each full field 
stimulus condition: Gaussian and binary marginal distributions, with variances 1/16, 1/12, 
1/8, 1/6, 1/4, 1/3, or 1/2 of baseline light intensity, and refreshed at either 8, 40, or 100 ms 
intervals. In each case, we added gap junction coupling with strengths from 1 to 16 times 
an experimentally measured value [32], and compared the resulting D KL with that obtained 
without recurrent coupling. Results are shown in Figure [10} 

At the measured coupling strength (g gap = 1.1 nS) itself, the fit of the pairwise model 
barely changed (Figure [To|B). At twice the measured coupling strength (g gap = 2.2 nS), 
recurrent coupling had increased higher order interactions, as measured by larger values of 
Dkl for all tested stimulus conditions. Higher order interactions could be further increased, 
particularly for long refresh rates (100 ms), by increasing the coupling strength to 4 or 8 



times its baseline level (g gap = 8.8 nS; see Figure 10 0, D). Consistent with our intuition that 
very strong coupling leads to a all-or-none" spiking patterns, D KL (P,P) decreased as g gap 
increased further, often to a level below what was seen in the absence of coupling (Figure 

Again, we considered the possibility that firing rates could explain the increase; however, 
firing rates actually decreased with the addition of coupling. Our findings are consistent with 
our results for the threshold model (Figure [6]), in that the impact of coupling is maximized 
at intermediate values of the coupling strength. Moreover, the impact of recurrent coupling 
on the maximal values of D KL evoked by visual stimuli is small overall, and almost negligible 
for experimentally measured coupling strengths. 

Scaling of higher-order interactions with population size 

The results above identify the input statistics — particular unimodality vs bimodality — as 
a key determinant of the success of PME models. How robust is this conclusion to network 
size? The permutation-symmetric architectures we have considered can be scaled up to more 
than three cells in several natural ways; for example, we can consider N cells with a global 
common input. The pairwise (local) input structure can also be scaled up to consider N cells 
on a ring, with each pair of adjacent cells receiving a common, pairwise input (see Figure [j]). 

We first considered a sequence of models in which a set of N threshold spiking units 
received global input I c (with mean and variance a 2 c) and an independent input Ij (with 
mean and variance a 2 (l — c)). As for the three cell networks above, the output of each cell 
was determined by summing and thresholding these inputs. The probability distribution of 
network outputs was computed as described in the Methods and then fit with a pairwise 
maximum entropy distribution. As also for the three cell networks, we explored a range of a, 
c, and and recorded the maximum value of Dkl(P, P) between the observed distribution 



P and its PME fit P. Figure 11 shows this D KL /N (i.e. entropy per cell [16]) for Gaussian, 
uniform, skewed, and bimodal input distributions. 

We found that the maximum D KL {P ) P) increased roughly linearly with N for bimodal 
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inputs, and superlinearly for unimodal inputs. The relative ordering found at TV = 3 — 
that the maximal achievable Dkl(P,P) is lowest for Gaussian inputs, followed by skewed, 



uniform, and bimodal inputs consecutively — remained the same. The sidebar of Figure 11 



shows that the probability distributions produced by these inputs qualitatively agree with 
this trend: departures from PME were more visually pronounced for global bimodal inputs 
(top histogram) than for global unimodal inputs (third histogram from top). At N = 16, 
the value D KL /N ^0.1 for bimodal global inputs corresponds to a likelihood ratio of 0.33 
that a single draw from P (single network output) in fact came from the PME fit P versus 
P; a likelihood < 0.01 is reached for 4 draws. 

We next considered pairwise inputs for TV > 3 cells by adopting a ring structure with 
nearest-neighbor common inputs (illustrated in Figure[T]). For unimodal inputs, we computed 
Dkl{P,P) while varying a and 0; for bimodal inputs, we varied the probability of each 
Bernoulli input. Figure 
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shows the maximal Dxl{P) P) per neuron. Overall, circuits 
with bimodal pairwise inputs showed appreciable values that are about half of that found 
for bimodal global inputs. The relatively large deviation at TV = 3 receded, replaced by 
deviations that were similar to those seen for global, unimodal inputs. For pairwise unimodal 
inputs, values of D KL {P ) P) remained very small. 

Finally, we extended our recurrent model to TV > 3 cells. As for the three cell networks, 
we explored a range of a, c, and for unimodal inputs, or s, p, and for bimodal inputs. 
In addition, the coupling strength, g, was varied for each type of input. Coupling was either 
all-to-all (global) or applied pairwise in a ring structure (local). Figure 12 shows the maximal 
Dkl(P^P) per neuron, for each type of input and network architecture, up to population 
size N = 8. Recurrent coupling increases the available range of higher order interactions 
in most settings from the level achieved with purely feed forward connections. However, 
the maximal value of D KL (P,P) per cell is steady or decreasing with TV, suggesting that 
recurrence has less impact as population size grows. 

To summarize, the greater impact of bimodal vs. unimodal input statistics on maximal 
values of D KL (P,P) that can be obtained in a given circuit persists from TV = 3 up to 
N = 1Q (tested up to TV = 8 in the recurrent case). Moreover, agreeing with intuition, global 
inputs of a given type can generate greater deviations than purely pairwise inputs (with 
the exception of one case, N = 3). Overall (again excepting this one case), for the circuit 
parameters producing maximal deviations, it becomes easier to statistically distinguish be- 
tween spiking distributions and their PME fits as TV increases in feedforward networks. In 
contrast, the maximal attainable Dkl{P, P) per cell appears to decrease with TV in many 
recurrent networks. 



Discussion 

We use simple mechanistic models to identify which combinations of network architectures 
and input signals produce spike patterns with higher-order interactions and which do not. 
Deviations in circuit outputs from pairwise maximum entropy (PME) predictions were much 
smaller than the maximum theoretically attainable values for a general spiking pattern. 
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Moreover, output statistics were not simply related to network architecture. Nonetheless, 
several simple principles emerge that determine the strength of higher-order interactions. 
First, bimodal input distributions produced stronger higher-order interactions than unimodal 
distributions. Second, networks with shared inputs among all cells produced greater higher- 
order interactions than those with pairwise inputs (except for the case of three cell networks 
receiving bimodal input). Third, recurrent excitatory or gap junction coupling could produce 
a further, moderate increase higher-order correlations; the effect was greatest for coupling of 
intermediate strength. Our overall results held for networks with nonlinear integrate-and-fire 
units based on measured properties of retinal ganglion cells. Together with the facts that 
ON parasol cell filtering suppresses bimodality in light input, and that coupling among ON 
parasol cells is relatively weak, our findings provide an explanation for why their population 
activity is well captured by PME models. 

Comparison with empirical studies 

How do our maximum entropy fits compare with empirical studies? In terms of D KL (P, P) 
- equivalently, the logarithm of the average relative likelihood that a sequence of data drawn 
from P was instead drawn from the model P — numbers obtained from our RGC models are 
very similar to those obtained by experiments on retinal ganglion cells [5|[6]. We find that 
Dkl(P-> P) = 2.90 x 10 -5 bits under constant light conditions, compared to an experimental 
value of 0.0008 |5| (inferred from a reported likelihood ratio of 0.99944). Under full-field, 
time varying light conditions, as well as spatiotemporally varying stixel simulations, we find 
average log-likelihood ratios of up to one order of magnitude larger - bounded above by 0.007. 
We can view this as a model of the checkerboard experiments of [5], for which close fits by 
PME distribution were also observed (likelihood numbers were not reported). Similarly, the 
values of A that are produced by our RGC model are close to those found by [5j[7] under 
comparable stimulus conditions. We obtain A = 99.5% (for cell group size N = 3) under 
constant illumination, which is near the range reported by |5| (97.8 - 99.2%, N = 3 - 7). 
For full-field stimuli we find a range of numbers from 95.7 — 99.3% (N = 3). 

The simple threshold models that we have developed, meanwhile, give us a roadmap for 
how circuits could be driven in such a way as to lower A. Figure [5p-F show A plotted 
as a function of firing rate for the data presented in Figure [5]/\-C: circuits of N = 3 cells 
receiving global common inputs. We observe that A ^ 1 for Gaussian and skewed inputs 
over a broad range of firing rates and pairwise correlation coefficients, but that values of A 
can be depressed by 10-15% in the presence of a bimodal common input. Indeed, Shlens 
et al. |5| showed that adding global bimodal inputs to a purely pairwise model can lead to 
a comparable departure in A. Our results are consistent with this finding, and explicitly 
demonstrate that the bimodality of the inputs — as well as their global projection — are 
characteristics that lead to this departure. 

While meaningful in an experimental study with non-neglible pairwise correlations, we 
caution that using A as a metric can be problematic when an idealized circuit is explored 
over its full range of parameters, because it may flag "uninteresting" cases in which cells are 
nearly independent, and a pairwise model adds little additional value. Specifically, if D in d 
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is small (i.e., the true distribution is well-approximated by the independent distribution 
Pi), then A may be appreciably far from 1 although D pair is small. Thus, a poor pairwise 
maximum entropy fit, as measured by A (that is, A < 1) is not necessarily indicative of a 
poor performance in Dkl{P, P)- For example, in the bimodal common input case (Figure 
[5]F), the very lowest values of A are achieved for low correlation p; in essence, when the 
independent model already does a good job of representing the output distribution. As 
suspected this performance is not reflected in Figure [5p, where low correlation gives low 
Dkl(P-> P)- In summary, A can be as low as 0.5 for distributions that are barely perceptibly 
different when measured by the Kullback-Leibler divergence. 

Figure [13]A-B extend our observations to a circuit of TV = 12 cells forced by Gaussian 
and skewed inputs respectively. We find that small A occurs for a <C 1, coincident with 
low population firing rates. Here P is nearly independent (because it is dominated by 
spiking events, the distribution is well-modeled by independent non-spiking neurons) and 
the improvement of the pairwise model over the independent model is negligible. The low 
population firing rate regime (Nv < 1, where v is the single cell firing rate per time bin and 
TV is the population size) is also the regime where 1 — A is linear in N — 2 | 17 1 . If a nontrivial 
deviation of A (from 1) is observed for Nv < 1 (such as, in Figure § TV = 3), then 1 - A 
must continue to grow as TV grows; equivalently, A must decrease. The growth of 1 — A with 



TV for particular points in this region is illustrated in Figure 13 



Using correlation structure to infer anatomical structure 

We address two questions about the relationship between the architecture of feedforward 
circuits and the statistical structure of the spike patterns that they produce, based on our 
comparisons between global-input and pairwise nearest-neighbor network architectures in 
the Results. 

First, if a circuit produces spike patterns that deviate substantially from pairwise maximum- 
entropy (PME) predictions, can we conclude that it has beyond-pairwise anatomical pro- 
jections — that is, common inputs received by more than two cells? For a small group of 
N = 3 cells, the answer is no: we find that, among all cases we study, the largest deviation 
from PME predictions occurs for purely pairwise (binary) inputs, so that departures from 
PME models do not imply departures from pairwise nearest-neighbor network architectures. 
For larger AT, the answer is a qualified yes: for marginal statistics of a given type, we show 
in Figure [TT] that the greatest deviations from PME models do correspond to global com- 
mon (as opposed to purely pairwise) inputs. However, without knowing input marginals, 
values of Dkl are still not predictive of anatomy: for example, if N = 16, then roughly the 
same values of D KL {P ) P) follow from global inputs with uniform marginals as for pairwise 
nearest-neighbor inputs with binary marginals. 

Second, if a circuit produces spike patterns that are well-described by PME models, does 
this imply that it has a pairwise architecture? Again the answer is no, as the success of 
PME models depends on N and marginal statistics. For A^ > 3 and known input marginals, 
better fits by PME imply pairwise nearest-neighbor connectivity; otherwise, such inferences 
cannot be made. 
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Scope and open questions 

Our first set of findings are for a set of circuit models with a simple thresholding nonlinearity 
at each cell. These models were chosen to be simple enough to allow analytical insights and 
a complete parametric study. 

While our retinal ganglion cell model demonstrates that these findings, based on a simple 
threshold model, do carry over to describe the spiking statistics of a more realistic spiking 
model (here, a time-dependent, nonlinear integrate-and-fire system), there are many aspects 
of circuits left unexplored by our study. Prominent among these is heterogeneity. Only a 
few of our simulations produce heterogeneous inputs to model RGCs, and all of our stud- 
ies apply to cells with identical response properties. This is in contrast to studies such 
as (71 which examine correlation structures among multiple cell types. For larger networks, 
feedforward connections with variable spatial profiles also occur, between the extremes of 
"nearest neighbor" and global input connections examined here. It is also possible that more 
complex input statistics could lead to greater higher-order interactions [37]. Finally, Figure 
11 indicates that some trends in D KL {P ) P) vs. N appear to become nonlinear for TV <J 10; 



for larger networks, our qualitative findings could change. 

Our study also leaves largely open the role of different retinal filters in generating higher- 
order interactions. We have found that the specific filtering properties of ON parasol cells 
suppress bimodality in light inputs, suggesting that other classes of retinal ganglion cells 
may produce more robust higher-order interactions (compare Figures 

|7p, E). This predicts 

a specific mechanism for the development of higher-order interactions in preparations that 
include multiple classes of ganglion cells |7|. Finally, we considered circuits with a single 
step of inputs and simple excitatory or gap junction coupling; a plethora of other network 
features could also lead to higher-order interactions, including multilayer feedforward struc- 
tures, together with lateral and feedback coupling. We speculate that, in particular, such 
mechanisms could contribute to the higher-order interactions found in cortex [8|[lO[[lTJ[38] . 

A final outstanding area of research is to link tractable network mechanisms for higher- 
order interactions with their impact (or lack of impact) on information encoded in neural 
populations |10,38|. A simple starting point is to consider rate-based population codes in 
which each stimulus produces a different "tuned" average spike count (see, e.g., Ch. 3 of |39|). 
One can then ask whether spike responses can be more easily decoded to estimate stimuli 
for the full population response (i.e., P) to each stimulus or for its pairwise approximation 
(P). In our preliminary tests where higher-order correlations were created by inputs with 
bimodal distributions, we found examples where decoding of P vs. P differed substantially. 
However, a more complete study would be required before general conclusions about trends 
and magnitudes of the effect could be made; such a study would include complementary 
approach in which the full spike responses P are themselves decoded via a "mismatched" 



decoder based on the pairwise model P 38 . Overall, we hope that the present paper, as 



one of the first that connects circuit mechanisms to higher-order statistics of spike patterns, 
will contribute to future research that takes these next steps. 
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Materials and Methods 



Dkl and distance from PME surface 

We can see that the curve along which /i and p are constant — the iso-moment line — remains 
a line in the (/ p ,/i p ,/i m ) coordinate space by inverting the equations \i — p\ + 2p 2 +£>3, 
P = P2 + £>3, and 1 = p + 3pi + 3p 2 + Ps- The first coordinate, f p =p +p 3 = 3(p - ji) — 1, 
is constant for a fixed (/x, p). Similarly, / lm and / lp can be written as linear functions of the 
remaining free parameter. 

To see that D KL (P, P) is convex along an iso-moment line, we consider Dkl(P, P) as P 
varies so as to remain on an iso-moment line. Letting / lm and fi p be the coordinates of the 
PME fit, and defining dm f\ m — / lm and dp fi p — we find 

dp = (/,-D/3 ix 

Si + (i - /„) 2 /9 

dm = — = rfx 
% 2 + (1 - / P ) 2 /9 



where rfx is an increment of distance along the iso-moment line. Inverting Equations [3] and 
substituting the results into the definition of D KL) we can write 

D KL (P,P) = 




(10) 

This is a convex function of dx; we can see this by observing that each of the four terms is 
a function of the form 

F( ix) = a (l + |),og(l + ^) 

(in the first term, for example, we have a = f p (l — fi p ) and /3 = —(1 — fi p )). This can be 
readily shown to be convex by taking the second derivative with respect to dx and verifying 
that it is positive: 

F"(dx) - a 



P 2 ( 1 + f 
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where we can verify that a > and \dx\ < /3. The sum of convex functions is likewise convex. 
Because Dkl(P,Q) is non-negative for any distributions P and Dkl(P,P) achieves its 
unique minimum along an iso-moment line at P = P, and it must monotonically increase as 
a function of \dx\. 

To get an intuitive picture of D KL {P ) P) as it varies in the /i m )-coordinate space, 

we view this quantity along constant-/^ slices (Figure [2]). D KL (P, P) increases with distance 
from the constraint curve. Generally, the range of this distance peaks at f p = 0.25. 

To further quantify the relationship between distance and D KLl we approximate the 



logarithms in Equation 10 for small arguments and find that Dkl increases quadratically 



with distance for small arguments: 

D KL (P,P) « (dx) 2 C(f p J lm ) + 0(dx 3 ) (11) 

where 

f p (l - f P Y/9 (1 - 3/ lm + 3/L) 2 



C(f p , flm) 



fp 2 (l"/p) 



PM-U) 1 



P P +^-U) 2 /$hm{l-hm) 



Numerical sampling of 3 cell network 

For general circuit set-ups, it may be necessary to probe the output distribution by sampling. 
In the case of global input, however, it is more computationally efficient and accurate to 
compute the output spiking probability distribution using quadrature. To be concrete, a set 
of TV = 3 threshold spiking units is forced by a common input I c (drawn from a probability 
distribution Pc{y)) and an independent input Ij (drawn from a probability distribution 
Pj(y)). The output of each cell xj is determined by summing and thresholding these inputs: 

Xj = H(Ij + I c - 6) 

Conditioned on I C) the probability of each spike is given by: 

Prob[xj = 1 I I c = a] = Prob[7j + a - 

= Probf^ > 6 

poo 

= / Pi(y)dy 

JS-a 

Similarly, we have the conditioned probability that xj = 0: 

Probfo = I J c = a] = Prob^ + a - 6 < 0] 

= Prob^ < 9 - a] 

/B-a 
Pi(y)dy 
-00 



-e > 0] 

— a] 
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Because these are conditionally independent, the probability of any spiking event (xi, x 2 , x 3 ) = 
(^1,^2, As) is given by the integral of the product of the conditioned probabilities against 
the density of the common input. 

/oo 3 
dy P c (y) [J Prob ta = a j \h = y] (12) 
3=1 

The integrand in the previous equation is numerically evaluated via an adaptive quadrature 
routine, such at Matlab's quad. This is easily generalized to an arbitrary number of cells N. 

Unimodal inputs Ij , I c were chosen from different marginals with mean and variance 
a 2 . For Gaussian input, P(x) oc e -x2 / 2(j2 ; for uniform inputs, P(x) oc 1 for \x\ < V3a 2 and 
otherwise. For skewed input, P{x) oc (x + fi)e~( x+ ^ / 2a , for x > — /x, where the parameter a 
sets the variance 2a(l — |) and shifting by \i = y^y ensures that the mean of P{x) is zero. 

When sampling was necessary in thresholding models, for TV < 14 the number of samples 
was chosen so that the mean bin occupancy of each unique circuit output was > 100 (and 
usually > 1000). For TV = 16 — the most poorly resolved case — the mean bin occupancy 
was lower (40). 



Bimodal triplet inputs always generate distributions with negative 
strain 

Another information theoretic quantity that relates to the ability of a distribution to be 
characterized by a pairwise model is the strain: 

1 / P(1,1,1)P(1,0,0)P(0,1,0)P(0,0,1) \ 
7 8 ° g U(0,0,0)P(l,l,0)P(l,0,l)P(0,l,l)y'- 

Indeed, this quantity must be zero for any distribution that satisfies the PME constraint 
(Equation [2]). Negative values of strain occur to the left side of the PME constraint curve 
in the (f\ p) /i m )-plane, whereas positive values occur to the right. 

For a circuit forced by common binary inputs, the simplicity of our setup allows us to 
show why observed distributions occur to the left side of the PME constraint curve. We 
approach this by showing that given the / lm coordinate of an observed distribution, the 
fi p coordinate is less than the PME fit would predict. A point on the constraint surface 
corresponding to a particular value of / lm may be written 

f 3 

J lm 

1 3/i m + 3/f m 



q 3 + (1 - qf 
whereas 

^ = PQ^_ 

lp pq 3 + (1 - p) + p(l - q) 3 
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and 

J_ = l-p + p(q 3 + (l-q) 3 ) 

fi P PQ 3 
1-p 1 

pq 3 hp 

This makes it clear that -J- > -j- with equality if and only if p = 1. Therefore 

jip tip 

hp < hp (13) 

unless p — 1, in which case they coincide. 

According to Equation [13J po is under-predicted by the PME model, whereas p 3 is over- 
predicted; this is precisely the condition of "sparse coding" [TT|[25] . 



An analytical explanation for unimodal vs. bimodal effects 

We consider an analytical argument to support the our numerical results that bimodal inputs 
generate larger deviations from PME model fits than unimodal inputs. As a metric, we 
consider D KL {P ) P) — where P and P are again the true and model distributions respectively 
- when we perturb an independent spiking distribution by adding a common, global input 
of variance c. To simplify notation, the small parameter in the calculation will be denoted 

We begin by observing that when P is a maximum entropy distribution; that is, its 
logarithm is given as a sum over functionals whose averages over P must also be satisfied by 
P, or 

log(p(x)) = 5> M / M (x)-logZ, ]>>(x)/ M (x) = ]>>(x)i- M (x); 

the KL-distance may be written as a difference of entropies: 
D KL (P,P) = £^)log(|[g) 

= J]p( x ) !og(p( x )) - ^2p(x) log(p(x)) 

X X 

= -S(P)-£p(x)(-lQgZ + £A,/ M (x)j 
= -5(P)+logZ-^A M ^p(x)/ /i (x) 

fl X 

= -5(P)+logZ-^A^p(x)/ M (x) 

fl X 

= -S(P) - ]>>(x) log(p(x)) 

X 

= -S(P) + S(P). 
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Here, the entropy of a probability distribution P is given 



S{P) = -p log(po) - 3pi log(pi) - 3p 2 log(p 2 ) - Ps log(p 3 ) (14) 

if we use the fact that the distributions are permutation-symmetric (i.e. p\ = P(1,0,0) = 
P(0, 1,0) = P(0,0, 1)). We take the logarithms in the definitions of entropy S and KL- 
divergence D KL to be base 2, so that any numerical values of these quantities are in units 
of bits. 

We now compute S(P) and S(P) by deriving a series expansion for each set of event prob- 
abilities. We can compute the true distribution P using the expressions derived in Equation 



12; to recap, let the common input have probability density p(c), and the independent input 



to each cell have density p s (x). Let 9 be the threshold for generating a spike (i.e., a "1" 
response). For each cell, a spike is generated if x + c > i.e., with probability 

d(c) = / p s (x)dx . 
Je-c 

Given c, this is conditionally independent for each cell. We can therefore write our proba- 
bilities by integrating over c as follows: 

/CO 
p(c)(l -d{c)fdc 
-co 

/CO 
p(c)d(c)(l -d{c)fdc (15) 
-co 

/CO 
p(c)d(c) 2 (l-d(c))dc 
-CO 

/CO 
p(c)d(cf dc 
-co 



We develop a perturbation argument in the limit of very weak common input. That is, p(c) 
is close to a delta function centered at c = 0. Take p(c) to be a scaled function 

P(c) = -jQ (16) 

We place no constraints on /(c), other than that it must be normalized (E[l] = 1) and 
that its moments must be finite (so E[c], E[c 2 ], and so forth must exist, where E[g(c)] = 

A //('•>/(<•> '/'•>• 

For the moment, assume that the function f(c) has a single maximum at c = 0. To 
evaluate the integrals above, we Taylor-expand d{c) around c = 0. Anticipating a sixth- 
order term to survive, we keep all terms up to this order. This gives, for small y, 

6 

d(y)*d(0) + J2 a ky k + 0(y 7 ) 
k=i 
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where a\ = p s (9) (the other coefficients a 2 -a 6 can be given similarly in terms of the indepen- 
dent input distribution at 9). Substituting this into the expressions for p 0j etc., above, with 
p(c) given as in Equation fl6j gives us each event as a series in e; for example, 



Ps = d 3 + (3a 1 d 2 E[c])e+ ((3a^ + 3a 2 rf^ E[c 2 ]) e 2 + ... 
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The entropy S(P) is now given by using these series expansions in Equation 

We note that our derivation does not rely on the fact that the distribution of common 
input is peaked at c = in particular. For example, we could have a common input centered 
around ji. The common input distribution function would be of the form 

p(c) = -f 



eve 



Changing e regulates the variance, but doesn't change the mean or the peak (assuming, 
without loss of generality, that the peak of / occurs at zero). The peak of p(c) now occurs 
at /x, and the appropriate Taylor expansion of d(y) is 

6 

k=l 

where the coefficients bk now depend on the local behavior of d around \i. The expectations 
that appear in the expansion of J93, and so forth, are now centered moments taken around 
/i; the calculations are otherwise identical. In other words, perturbation expansion requires 
the variance of the common input to be small, but not the mean. 

For bimodal inputs, we consider a common input with a probability distribution of the 
following form: 

so that most of the probability distribution is peaked at zero, but there is a second peak of 
higher order (here taken at c = 1, without loss of generality). Again, we approximate the 
integrals given in Equations 15, and therefore the entropy S(P) : by Taylor expanding d; 

6 

d(c) w d(0) + J2 a kC k + O(c 7 ); (c«0) 

6 

~ rf(l) + ^6,(c-l) fc + 0((c-l) 7 ); (c«l) 



around the two peaks and 1 respectively. For each integral we have the same contributions 
from the unimodal case, multiplied by (1 — e 2 ), as well as the corresponding contributions 
from the second peak multiplied by e 2 (these weightings are chosen so that the common 
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input has variance of order e 2 , as in the unimodal case). This makes clear at what order 
every term enters. 

We now construct an expansion for the PME model P: 

P(x 1) x 2) x 3 ) = \- exp (Ai(xi + x 2 + £3) + X 2 (xix 2 + x 2 x 3 + £1X3)) 

Zj 

We approach this problem by describing Ai and A2 as a series in e. We match coefficients by 
forcing the first and second moments of P to match those of P — as they must. Specifically, 
take 



Xi = A + ^e^ + 0(e 7 ) 
k=i 

6 

A 2 = 5>%, + 0(e 7 ) 



k=l 

where Ai = A, A 2 = are the corresponding parameters from the independent case. The 
events p 0j Pi •> P2 and p 3 can be written as a series in e. We then require that the mean and 
centered second moments of P match those of P; that is 

pi + 2p 2 +p 3 = p 1 + 2p 2 + p 3 

P2+P3- (Pi + 2p 2 + p 3 ) 2 = p 2 +p 3 - (pi + 2p2 + Ps) 2 - 

At each order fc, this yields a system of two linear equations in Uk and Vk] we solve, inductively, 
up to the desired order; we now have P, and therefore S(P), as a series in e. 
Finally, we combine the two series to find that in the unimodal case, 



D KL (P,P) = S(P)-S(P) 



q6(2E[c] 3 -3E[c]E[c 2 ]+E[c 3 ]) 2 
2(1 -d fd 3 



+ 0(e 7 ) 



(17) 



If the first two odd moments of the distribution are zero (something we can expect for 
"symmetric" distributions, such as a Gaussian), then this sixth-order term is zero as well. 
For the bimodal case 



D KL (P,P) = S(P)-S(P) 

(di - d Y 



2(1 - d fdl 



+ 0(e 5 



This last term depends on the distance d\ — do, i n other words, how much more likely the 
independent input is to push the cell over threshold when common input is "ON". We 
can also view this as depending on the ratio , which gives the fraction of previously 
non-spiking cells that now spike as a result of the common input. 

The main point here, of course, is that Dkl(P, P) is of order e 4 rather than e 6 . So, as 
the strength of a common binary vs. unimodal input increases, spiking distributions depart 
from the PME more rapidly. 
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Experimentally-based model of a RGC circuit 

We model the response of a individual RGC using data collected from a representative 



primate ON parasol cell, following methods in 32,35 . Similar response properties were 
observed in recordings from 16 other cells. To measure the relationship between light stimuli 
and synaptic conductances, the retina was exposed to a full-field, white noise stimulus. The 
cell was voltage clamped at the excitatory (or inhibitory) reversal potential Ve — mV 
(Vj = —60 mV), and the inhibitory (or excitatory) currents were measured in response to 
the stimulus. These currents were then turned into equivalent conductances by dividing by 
the driving force of ±60 mV; in other words 

jexc = g exc(y _ y^. V ~ V E = "60 HlV 

Jinh = g inh(y _ y^. y _ ^ = g Q m y 

The time-dependent conductances g exc and g inh were now injected into the same cell using 
a dynamic clamp (i.e., input current was varied rapidly to maintain the correct relationship 
between the conductance and the membrane voltage) and the voltage was measured at a 
resolution of 0.1 ms. 

To model the relationship between the light stimulus and synaptic conductances, the 
current measurements 7 exc and 7 mh were fit to a linear-nonlinear model: 

7 exc (t) = A^ exc [L exc *5(t)+ry exc ], 
I inh (t) = N inh [L inh * s(t) + r] inh ] 

where s is the stimulus, L exc (L mh ) is a linear filter, 7V exc (A^ mh ) is a nonlinear function, and 
^exc ^mhj j g a no j se term. The linear filter was fit by the function 

£ exc (t) = Pexc(t/r e xc) nexc exp(-t/r exc )sin(27rt/r exc ) (18) 
and the nonlinear filter by the polynomial 

L mh and 7V mh were fit using the same parametrization. The noise terms t^ xc , rf£ h were fit to 
reproduce the statistical characteristics of the residuals from this fitting. We simulated the 
noise terms /y exc and r] mh using Ornstein-Uhlenbeck processes with the appropriate parame- 
ters; these were entirely characterized by the mean, standard deviation, and time constant 
of autocorrelation t v ^ c (r^^h), as well as pairwise correlation coefficients for noise terms 
entering neighboring cells. The noise correlation coefficients were estimated from the dual 



recordings of 32 



Linear filter parameters computed were P exc = — 8 x 10 4 pA/s, n exc = 3.6, r exc = 12 
ms, T exc 105 ms, and P in h — 1.8 x 10 5 pA/s, n in h 3.0, r in h 16 ms, T exc 120 
ms. Nonlinearity parameters were A eyiC = — 5 x 10 -5 , £> exc = 0.42, C exc = —57, and A-^ = 
1 x 10 4 , £> in h = 0.37, Ci n h = 250. Noise parameters were measured to be mean(^ xc ) = 30, 
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std(^ xc ) = 500, = 22 ms, and mean(?7j nh ) = -1200, std(C h ) = 780, r^ inh = 33 ms. In 
addition, excitatory (inhibitory) noise to different cells rjl xc , ?y| xc {rj™ h , rf-^) had a correlation 
coefficient of 0.3 (0.15). 

To obtain the rectified filter shown in Figure [7^, we simply took the absolute value of 
the filter function; i.e. L exc > R (t) = \L exc (t)\. 

Model fitting 

We create a model of the cell as a nonlinear integrate-and-fire model using the method of 
Badel et al. |33|, in which the membrane voltage is assumed to respond as 

^ = F(V,t-t last ) + 1 ^^ (19) 

where C is the cell capacitance, ti ast is the time of the last spike before time £, and Ii npu t(t) 
is a time-dependent input current. We use the current-clamp data, which yields cell voltage 
in response to the input current Ii npu t(t) = g eyLC (t)(V — Ve) + g mh (V — V/-), to fit a function 
F(V, t). When voltage data is segregated according to the (binned) time since the last spike, 
the I — V curve is well fit by a function of the form 

F(V) = —(E L -V + A T e {y - VT)/AT ) (20) 

The membrane time constant r m , resting potential (El), spike width and knee of the 
exponential curve Vt are parameterized as a function of t — ti ast . Our model neuron comprises 



Equations (19, 20) for V < Vthreshoid] a spike was detected when V reached V t hreshoid = — 30 
mV, with a voltage reset to V reset = —55 mV after 2 ms. In addition, the cell was unable to 
spike for an absolute refractory period of r a 5 S = 3 ms. 

The capacitance was inferred from the voltage trace data by finding, at a voltage value 
where the voltage/membrane current relationship is approximately Ohmic, the value of C 



that minimizes error in the relation Equation 19 |33|. The estimated value was C = 28 pF. 



Recurrent model 

Gap junction coupling was introduced as an additional current on the right-hand side of 
Equation [I9j 

The coupling strength g gap was held constant during a simulation. When coupling was 
present (i.e. when g gap ^ 0), g gap was varied from a baseline level (1.1 nS) |32| to 16 times 
this value (17.6 nS) between simulations. 

For simulations that include electrotonic coupling, the spike trajectory was modeled with 
greater care. A typical (averaged) spike waveform was extracted from voltage traces of a 
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primate ON parasol cell. The spike waveform was used to replace 1 ms of the membrane 
voltage trajectory during and after a spike; at the end of the 1 ms, the voltage was released 
at approximately —58 mV. The cell was unable to spike for an absolute refractory period of 
Tabs = 3 ms. A relative refractory period was induced by introducing a declining threshold 
for the period of 3-6 ms following a spike, after which V t h limits to -30 mV. 



Cell receptive field 

We defined each cell's stimulus as the linear convolution of an image with its receptive field. 
The receptive fields include an "on" center and an "off" surround, as in |40|: 

Sj(x) exp ^— — Xj)Q(x — Xj^j — fcexp ^— ^r(x — Xj)Qr(x — Xj)^j 

where the parameters k and 1/r give the relative strength and size of the surround. Q 
specifies the shape of the center and was chosen to have a 1 standard deviation (SD) radius 
of 50 /xm and to be perfectly spherical. The receptive field locations xi, x 2 , and x 3 were 
chosen so that the 1 SD outlines of the receptive field centers will tile the plane (i.e. they 
just touch). Other parameters used were k = 0.3, r = 0.675. 



Numerical methods and convergence testing 



For each simulation, Equations 19, 20 were integrated using the Euler method for > 10 5 ms 
with a time step of 0.1 ms. For each stimulus condition, 20 simulations (or sub-simulations) 
were run, for a total integration time of > 20 x 10 5 ms. The synaptic noise terms, ^ xc and 
?^ nh , as well as the light input, were generated independently for each sub-simulation. To 
discretize spiking outputs, 5 ms bins were used. 

These 20 sub-simulations were used to estimate standard errors in both the probabil- 
ity distribution over spiking events and D KL {P ) P). For example, in the constant light 
case, we generated the following distribution on spiking events: P(0, 0, 0) = 0.816 ± 0.004, 
P(0,0,1) = 0.0457 ± 0.0015; P(0,1,0) = 0.0448 ± 0.0015, P(1,0,0) = 0.0459 ± 0.0017, 
P(0, 1, 1) = 0.00554±0.00054, P(l, 0, 1) = 0.00545±0.00051, P(l, 1, 0) = 0.00545 ±0.00056, 
and P(l, 1, 1) = 0.00116 ± 0.00020. Numbers reported in the Results are, unless specified 
otherwise, produced by collating the data from the 20 simulations. 

To test our finding that the observed distributions were well-modeled by the PME fit, we 
also performed the PME analysis on each of the 20 simulations for each stimulus condition. 
While in general Dkl{P) P) can be quite sensitive to perturbations in P, the numbers re- 
mained small under this analysis. To confirm that our results for Dkl{P, P) are sufficiently 
resolved to remove bias from sampling, we performed an analysis in which we collect the 20 
simulations in subgroups of 1, 2, 4, 5, 10, and 20, and plot the mean D KL with estimated 
standard errors. As expected (e.g. [4l]), bias decreases as the length of subgroup increases 
and asymptotes at — or before — the full simulation length. Results are shown in Figure 



SI for the RGC simulations under full-field stimulation, as well as two representative cases 



with "stixeF stimuli. 
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For the thresholding model, sampling was used for pairwise input cases in Figure [TTJ 
and all data presented in Figure [12} In all but three simulations, the sampling frequency 
was chosen so that the mean bin occupancy (that is, the average frequency of each unique 
spiking pattern) was > 100 (and usually > 1000). The exceptions are for the cases where 
the network size N = 16 and unimodal pairwise inputs are used, for which the mean bin 
occupancy was « 40. We show the bias for TV = 16, pairwise Gaussian inputs in Figure |ST]A, 
which confirms that the bias is close to its asymptotic value of zero. 

To provide a cross-validation test, we divided our data into halves (which we denote Pi 
and P 2) each including data from 10 subsimulations) and performed the PME analysis on 
one half (say Pi) to yield a model Pi. We then computed D KL (P 2) Pi) and D KL (P 2) Pi) 
(as in |42]), which we refer to the cross-validated and empirical likelihood respectively. The 
former tests whether the PME fit is robust to over-fitting; the latter tests how well-resolved 
our "true" distribution is in the first place. In Figure S2, we plot these numbers vs. the 
numbers reported in the paper. Most cross- validated likelihoods fall on or near the identity 
line; most empirical likelihoods are close to zero (and importantly, significantly smaller than 
either D KL (P,P) or D KL (P 2) Pi) ) indicating that D KL (P,P) is accurately resolved). We 
conclude that the deviations that we observe when these conditions are met can not be 
accounted for by the differences in testing and training data. 

Finally, to test the robustness of our finding that the strain 7 < for the full-field 
RGC simulations, we perturbed our spiking event distributions randomly, with perturbations 
weighted by the estimated standard errors. This was repeated 20 times for each stimulus 
condition. Out of the resulting 840 perturbations (20 each for 42 stimulus conditions), 7 > 
in only 22 trials. 
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Figure 1: Schematic showing different axes on which we explore microcircuits in this study. 
(Top left) Network architecture: global vs. pairwise inputs and scaling up system size 
N. (Top right) Presence vs. absence of reciprocal coupling. (Bottom left) Input statistics: 
unimodal vs. bimodal marginal statistics. (Bottom right) varying strength of common input. 
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Figure 2: Geometrical organization of D KL (P,P) within the space of three-cell spiking 
distributions. Outer plots: Slices of D KL (P,P) along surfaces f p = constant. Cen- 
ter: schematic of the (f pi fi p , fi m ) coordinate space. Counterclockwise from lower right: 
f p = 0.1,0.25,0.58,0.76,0.9. f p = 0.25 contains the maximal attainable Dk^P^P) over 
all admissible P. f p = 0.616 contains the maximal attainable D KL from pairwise bimodal 
inputs (see Results). 
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Figure 3: Examples of spiking distributions with small, intermediate, and large deviations 
from PME models (as quantified by the KL- divergence, D KL {P ) P), between the observed 
distribution P and its PME fit P). (Top row) Bar plot contrasting three distributions with 
their pairwise maximum entropy (PME) fits. The probability shown is the total probability 
of all events with a certain number of spikes. From left: global skewed input, global binary 
input, XOR operator. D KL (P,P) is, from left, 0.0013 (skewed), 0.091 (bimodal), and 1 
(XOR). (Middle row) The same distributions (crosses) projected into the (/i m , /i p )-plane 
and their corresponding PME fits (circles). The cyan line is the iso-moment line of all 
distributions with the same first and second moments, and the black curve is the PME 
constraint surface. (Bottom row) Dkl{P, P) along the iso-moment line (cyan solid) and the 
quadratic approximation derived in the Methods, (black dashed). 
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Figure 4: Deviation from PME fit for circuits receiving independent and (global) common 
input. Results shown for TV = 3; (A) Gaussian, (B) uniform, (C) skewed, and (D) bimodal. 
For each choice of marginal input statistics, possible input parameters are varied over a 
broad range as described in the Results; over c G [0, 1], a G [0, 4], and G [—1, 3] (unimodal 
inputs), over p G [0, 1] and q G [0, 1] (bimodal inputs). (Left column) Schematic of input dis- 
tributions. (Center column) Projection of all distributions onto the /i m )-plane. (Right 
column) For the value of for which the maximal value is achieved, a slice of D KL (P,P) 
(unimodal inputs); contour plot of D KL (P, P) where 1 < < 2 (bimodal inputs). 
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Figure 5: Relationship between measures of higher-order interactions and other output firing 
statistics. (A-C) D KL (P, P) versus firing rate E[xj], for data obtained from N = 3 threshold- 
ing cells. For each choice of marginal input statistics, possible input parameters were varied 
over a broad range as described in the Results; over cG [0,1], a G [0, 4], and B £ [—1, 3] for 
unimodal inputs, and over p £ [0, 1] and q £ [0, 1] for bimodal inputs. In each panel, data 
is organized by p, from dark (p £ (0,0.1)) to light (p £ (0.9, 1)); (A) Gaussian inputs, (B) 
skewed inputs, (C) bimodal inputs. (D-F) The fraction of multi-information captured by the 
PME model, A, versus firing rate E[xi], for these same distributions. Data is organized by 
correlation coefficient p, as in panels (A-C); D) Gaussian, E) skewed, F) bimodal. 
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Figure 6: Effect of recurrent coupling on PME fits of circuit outputs for the thresholding 
model. Left panel: scatter plot of D KL achieved with Gaussian inputs vs. coupling strength 
g : for TV = 3 cells. Right panel: another view of D KL (P,P) vs. coupling strength g : for 
N = 3 cells. In the top three panels, for unimodal inputs, c, a, and were fixed at a single 
representative value, while g was varied. The top panel shows the results from circuits with 
Gaussian inputs (global and pairwise inputs are equivalent) and the next two for skewed and 
uniform inputs with either global (thin lines) or local (heavy lines) inputs. For all unimodal 
inputs, (c, a, B) were (0.5, 1, 1). Analogous results from circuits with either global (thin line) 
or local (heavy line) bimodal inputs are shown in the bottom panel. Here, p, q : and were 
fixed at a single representative value, while g was varied. The parameters (p, g, 0) were 
(0.81, 0.5, 1.5) and (0.5, 0, 1.5) for global and pairwise inputs respectively. 
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Figure 7: Composite of results for RGC simulations with constant light and full field flicker. 
In rows (A-C), we have (left) a histogram and time series of stimulus, (center) a histogram 
of excitatory conductances (multiplied by driving voltage —60 mV, and therefore in units 
of pA) and (right) the resulting distribution on spiking patterns. (A) Gaussian noise only. 
(B) Gaussian input, standard deviation 1/6, refresh rate 8 ms. (C) Binary input, standard 
deviation 1/3, refresh rate 8 ms. (D-E) Binary input, standard deviation 1/3, refresh rate 
100 ms. In the top left panels, the excitatory filter L exc (i) is shown instead of a stimulus 
histogram; in the bottom left panels, the normalized excitatory conductance (in pA — red 
dashed line) is superimposed on the stimulus (blue solid). (D) Filter fit from data, parameters 
given in Methods. Both the filter and conductance trace illustrate that the LN model that 
processes light input acts as a (time-shifted) high pass filter. (E) As in (D), but with rectified 
filter, producing a bimodal distribution of conductances. 




Figure 8: Comparison of RGC simulations computed with the original ON parasol filter, 
vs. simulations incorporating a rectified filter. (A) Dkl(P) P) for original vs. rectified 
filter. Data is organized by stimulus refresh rate (8, 40, and 100 ms) and marginal statistics 
(Gaussian vs. binary). (B-C) Histograms of excitatory conductances (multiplied by driving 
voltage —60 mV, and therefore in units of pA) for representative simulations, under original 
(left) and rectified (right) filters. (B) Binary stimulus with standard deviation of 1/4 and 
refresh rate 100 ms. The histogram shows a significant shift towards a bimodal structure, 
with a corresponding increase in D KL (P, P). (C) Binary stimulus with standard deviation of 
1/2 and refresh rate 8 ms. The histogram shows little qualitative change under the change 
in filter, and little change in D KL (P, P). 
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Figure 9: Results for RGC simulations with light stimuli of varying spatial scale ("stixels"). 
With the exception of (A), we show data from the binary light distributions; results from 
the Gaussian case are similar. (A) Average D KL (P,P) as a function of stixel size. Values 
were averaged over 5 stimulus positions, each with a different (random) stimulus rotation 
and translation; 512 /im corresponds to full field stimuli. (B,C) Probability of singlet and 
doublet spiking events, under stimulation by movies of 256 fim (B) and 60 fim (C) stixels. 
Event probabilities are plotted in 3-space, with the x, y ) and z axes identifying the singlet 
(doublet) events 001 (Oil), 010 (101), and 100 (110) respectively. The black dashed line 
indicates perfect cell-to-cell homogeneity (e.g. P(1,0,0) = P(0, 1,0) = P(0,0, 1)). Both 
individual runs (dots) and averages over 20 runs (large circles) are shown, with averages 
outlined in black (singlet) and gray (doublet). Different colors indicate different stimulus 
positions. (D-F) Results for one stimulus position, with stixel size 60 fim. (D) Contour lines 
of the three receptive fields (at 0.5, 1, 1.5 and 2 SD; and at the zero contour line) superim- 
posed on the stimulus checkerboard (for illustration, pictured in an alternating black/white 
pattern). (E) Marginal distributions of the excitatory conductances, for each cell. (F) Spike 
pattern distribution; the three different probabilities labeled pi correspond to, e.g, P(l, 0, 0), 
P(0, 1,0), and P(0,0, 1)), demonstrating heterogenous responses among the RGCs. (G-I) 
As in (D-F), but for stixel size 256 /xm. 
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Figure 10: The impact of recurrent coupling on RGC networks with full-field visual stimuli. 
The strength of gap junction connections was varied from a baseline level (relative magnitude 
g = 1, or absolute magnitude g gap = 1.1 nS) to an order of magnitude larger (g = 16, 
or g gap = 17.6 nS). (A) D KL (P,P) obtained from two independent simulations without 
coupling, for each of 42 different stimulus ensembles (deviation of this data from the line 
y = x thus gives a control for the statistical variability of the results in later panels). (B-F) 
Dkl(P-> P) obtained with coupling, plotted versus the value obtained for the same stimulus 
ensemble without coupling, for each of 42 different stimulus ensembles. 



47 



0.16 
0.14 



-jfc- pairwise, bimodal global, bimodal 
-0- pairwise, uniform -0- global, uniform 
A pairwise, skewed global, skewed 
♦ pairwise, gaussian — global, gaussian 



0.4 
0.2 





Observed 
PME 



-A 

▼ 

8 





fl Jl .n In In L In In In m Jl n n 


4 8 12 16 






4 8 


12 16 



4 8 12 16 



N 



10 12 14 16 



0.2 
).1 

'o 





li.. 


4 8 1 


2 16 



Figure 11: Maximal deviation from PME fit for thresholding circuit model, forced by either 
global or local input against background noise, for increasing network size N. For each TV, 
possible input parameters are ranged as described in the Results: over c G [0, 1], a G [0,4], 
and G [—1,3] (unimodal inputs), over s G [0,1] and p G [0,1] (bimodal inputs). The 
sidebar shows sample distributions with maximal D KL (P,P) for TV = 16; from top, global 
bimodal inputs, pairwise bimodal inputs, global Gaussian inputs, and pairwise Gaussian 
inputs. 
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Figure 12: Maximal deviation from PME fit for circuit forced by either global or local 
input against background noise and either global or local recurrent coupling. Plot shows 
the maximal D KL achieved over all parameters as TV increases. For each AT, possible input 
parameters are ranged described as in the Results; over c G [0, 1], a G [0, 4], and G [—1, 3] 
(unimodal inputs), over s G [0, 1], p G [0, 1], and B G [0.3, 1.9] (bimodal inputs). In addition, 
recurrent coupling strength g is varied over [0,2.4]. The sidebar shows sample distributions 
with maximal D KL {P ) P) for TV = 8; from top, global bimodal inputs with global coupling; 
pairwise bimodal inputs with global coupling; global Gaussian inputs with pairwise coupling. 
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Figure 13: The fraction of multi-information captured by the PME model, A, is low for 
some network parameters as population size increases. Here, we show data obtained from 
feedforward networks of N = 12 thresholding cells. For each choice of marginal input 
statistics, possible input parameters were varied over correlation coefficient c G [0, 1] and 
input variance a G [0,4]. The threshold was held at = 1. (A-B) A for (A) Gaussian and 
(B) skewed input, for TV = 12. (C) 1 — A vs. TV for Gaussian, skewed and uniform inputs at 
a fixed value of c and a (c = 0.56; for Gaussian and skewed, a = 0.2; for uniform, a = 0.6). 
Because of the low firing rate, 1 — A must grow linearly with TV (see text). 
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Figure SI: Mean and estimated standard errors of D KL {P ) P), as a function of subgroup size. 
The 20 simulations for each circuit condition (see Materials and Methods) were collected into 
subgroups of M — 1,2, 4, 5, 10, and 20. M = 20 corresponds to the full simulation length — 
2 x 10 6 ms in (B), 2 x 10 5 ms in (C,D) — reported in the text. As expected, bias decreases as 
the length of subgroup increases and asymptotes at — or before — the full simulation length. 
(A) = 16, Gaussian pairwise inputs, for the sum-and-threshold model. (B) Full-field RGC 
simulations. (C) Spatially variable RGC simulations, binary stimulus, stixel size = 4/im. 
Different colors signify different positions of stimulus relative to receptive field. (D) As in 
(C), but stixel size = 256 fim. 
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Figure S2: Cross-validated and empirical values of D KL {P ) P) ) vs. values reported in the 
paper. (A) Full field, (B) Rectified, (C) Stixel simulations with binary inputs, (D) Same as 
(C), zoomed into origin. 
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